Add DTLS throughput benchmark tool and optimize send path#10551
Add DTLS throughput benchmark tool and optimize send path#10551julek-wolfssl wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new DTLS throughput benchmark under examples/benchmark/ and makes two optimizations in the DTLS send path to better measure (and reduce) per-record overhead in wolfSSL’s record layer and socket I/O glue.
Changes:
- Add
examples/benchmark/dtls_bench.c: a DTLS 1.2/1.3 throughput benchmark with cipher selection, plain-UDP baseline mode, and a client-side “sink send” mode. - Optimize DTLS send path by caching the
SO_TYPE(datagram vs stream) probe inWOLFSSL_DTLS_CTXinstead of callinggetsockopt()on every send. - Optimize AEAD explicit-nonce construction by writing the record sequence number directly for suites where the explicit nonce is defined as the seq number, using a new read-only
PeekSEQ()helper.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
wolfssl/internal.h |
Adds DTLS context fields for caching socket type probe results. |
tests/api.c |
Resets new DTLS context cache fields when copying SSL state in an API test helper. |
src/wolfio.c |
Changes datagram-vs-stream detection to cache SO_TYPE results. |
src/ssl.c |
Invalidates the DTLS socket-type cache when read/write fds are (re)assigned. |
src/internal.c |
Adds PeekSEQ() and uses it to derive AEAD explicit nonce from sequence number for applicable suites. |
examples/benchmark/include.am |
Adds dtls_bench to Automake build outputs. |
examples/benchmark/dtls_bench.c |
New DTLS benchmark tool implementation. |
.gitignore |
Ignores the built examples/benchmark/dtls_bench binary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8d445f1 to
32c7f0b
Compare
|
retest this please |
7b5387d to
4068636
Compare
|
4068636 to
0d93481
Compare
|
retest this please. |
0d93481 to
0c90533
Compare
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected. Optimize the send path exercised by the benchmark: - wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned. - internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData(). Also ignore the built dtls_bench binary in .gitignore.
0c90533 to
1e7d632
Compare
|
The jenkins failures are not related to this PR. |
dtls_bench.c is built whenever DTLS and the example servers are enabled, including the cross-mingw-all-crypto multi-test scenario, which cross- compiles for Windows. It directly includes POSIX-only headers (<sys/socket.h>, <arpa/inet.h>, <netdb.h>, <net/if.h>) that mingw does not ship, so the build failed there. Gate the networking includes and the whole benchmark body behind a DTLS_BENCH_ENABLED check (WOLFSSL_DTLS, not USE_WINDOWS_API, not WOLFSSL_NO_SOCK). When the platform lacks POSIX BSD sockets, compile a small stub main() that reports the tool is unsupported, so the source tree still builds.
1e7d632 to
cae9827
Compare
dgarske
left a comment
There was a problem hiding this comment.
Skoll Code Review
Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
5 finding(s) posted as inline comments (see file-level comments below)
Posted findings
- [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED —
src/wolfio.c:655-666 - [Low] Single isDGram cache shared between rfd and wfd of potentially different socket types —
src/wolfio.c:649-667 - [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths —
src/internal.c:24809-24831 - [Low] now_sec ignores clock_gettime failure, can return uninitialized time —
examples/benchmark/dtls_bench.c:96-101 - [Low] -z (sink send) silently ignored when combined with -s (server) —
examples/benchmark/dtls_bench.c:parse_args
Review generated by Skoll
| /* optvalue 'type' is of size int */ | ||
| XSOCKLENT length = (XSOCKLENT)sizeof(type); | ||
|
|
||
| if (dtlsCtx->isDGramCached) |
There was a problem hiding this comment.
🟠 [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED
Previously isDGramSock() was a pure read of the socket type. It now lazily writes dtlsCtx->isDGram and dtlsCtx->isDGramCached (bit-fields that share a storage unit with connected, and with processingPendingRecord when WOLFSSL_DTLS_CID is set). In a WOLFSSL_RW_THREADED build, EmbedSendTo (write thread) calls this on every send and EmbedReceiveFrom (read thread) calls it on the recvd == 0 path, so two threads can perform the first-time write concurrently. Writing adjacent bit-fields is a read-modify-write of the shared byte and is not atomic, so this is a data race (UB per the C memory model) on dtlsCtx. In practice it is benign because both threads write the same socket-type value and connected is not modified concurrently, but it can be reported by ThreadSanitizer CI and is a new race introduced by moving shared-state mutation into the hot I/O path. The rest of dtlsCtx peer state is already protected by peerLock precisely because these callbacks run concurrently.
Fix: Confirm intent for threaded builds; prefer computing the cache eagerly at fd assignment to remove the race from the data path entirely.
| * fixed for its lifetime, so the getsockopt(SO_TYPE) probe runs once per | ||
| * connection and is cached in dtlsCtx; the cache is invalidated wherever | ||
| * rfd/wfd is reassigned. */ | ||
| static int isDGramSock(WOLFSSL_DTLS_CTX* dtlsCtx, int sfd) |
There was a problem hiding this comment.
🔵 [Low] Single isDGram cache shared between rfd and wfd of potentially different socket types
The cache stores one isDGram flag per dtlsCtx, but it is consulted for both the read descriptor (dtlsCtx->rfd, via EmbedReceiveFrom) and the write descriptor (dtlsCtx->wfd, via EmbedSendTo). If an application sets rfd and wfd to different sockets of different types (legal via wolfSSL_set_read_fd/wolfSSL_set_write_fd), the first probe wins and the cached value is then incorrectly applied to the other descriptor. This is unusual for DTLS (rfd == wfd normally) and the cache is invalidated on each fd reassignment, so the common path is fine, but the original code re-probed the exact sfd passed in each time, whereas the cache is now keyed on the context rather than the descriptor.
Fix: Confirm the rfd==wfd (or same-type) assumption holds for all supported DTLS usages; document it next to the cache fields.
| ret = wc_RNG_GenerateBlock(ssl->rng, args->iv, args->ivSz); | ||
| if (ret != 0) | ||
| goto exit_buildmsg; | ||
| #ifdef HAVE_AEAD |
There was a problem hiding this comment.
🔵 [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths
The new code writes the record sequence number into the explicit nonce via PeekSEQ(ssl, epochOrder, args->iv) instead of wc_RNG_GenerateBlock. In every standard configuration this written value is subsequently overwritten before transmission: the non-FIPS/FIPSv2+ path copies the cipher-generated IV over it (XMEMCPY(out, ssl->encrypt.nonce + AESGCM_IMP_IV_SZ, ...) at line ~20741), and the legacy-FIPS path overwrites args->iv with ssl->keys.aead_exp_IV (line ~24839). So the real effect of this change is removing the per-record RNG draw; the seq value itself is intentionally discarded. Note also that the actual AAD seq is written with CUR_ORDER inside writeAeadAuthData (line ~20695) while PeekSEQ here uses epochOrder, so the two would differ for DTLS SCR PREV_ORDER retransmits, but this is harmless precisely because the PeekSEQ-written bytes are discarded. This is correct but subtle; a future reader could mistakenly assume the on-wire explicit nonce equals PeekSEQ.
Fix: Keep the change; add a short clarifying comment to prevent future misreading and confirm the RNG-removal is the intended win.
| #define DEFAULT_SOCKBUF (8 * 1024 * 1024) | ||
|
|
||
| /* Upper bound on per-record overhead the user must leave under MTU. | ||
| * DTLS 1.2 worst case: 13 (hdr) + 8 (explicit nonce) + 16 (AEAD tag) = 37. |
There was a problem hiding this comment.
🔵 [Low] now_sec ignores clock_gettime failure, can return uninitialized time
clock_gettime()'s return value is cast to void and ts is not initialized, so on the (rare) failure of CLOCK_MONOTONIC the function returns garbage from the stack, corrupting all throughput/duration math. This is benchmark-only code so impact is limited, but it is trivial to harden.
Fix: Initialize ts and check the return value, returning a sentinel on failure.
| break; | ||
| case 'b': c->sockBuf = atoi(optarg); break; | ||
| case 'n': c->plainUdp = 1; break; | ||
| case 'z': c->sinkSend = 1; break; |
There was a problem hiding this comment.
🔵 [Low] -z (sink send) silently ignored when combined with -s (server)
The help text documents -z as client-only, but parse_args performs no validation when -z is passed together with -s. In server mode sinkSend is parsed and then never used, so a user invoking -s -z gets no warning that the flag had no effect. Minor usability nit in example code.
Fix: Emit a warning (or error) when -z is combined with -s so the no-op is visible to the user.
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected.
Optimize the send path exercised by the benchmark:
wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned.
internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData().
Also ignore the built dtls_bench binary in .gitignore.